6.5 Data Description

There are 1146 annotations in the corpus, 982 of which are fully annotated logograms. The rest of the annotations are marked using the exclude flag, most of them being long transcriptions of polysyllabic signs rather than single logograms. These have been split into two (or more) independent logograms, but the original annotations are also included for reference. Some other transcriptions, marked with the flag problem, present some kind of graphical or representational problem, which has led us to exclude them for now from annotation, but are kept in the corpus.

Within the 982 fully annotated logograms, 6060 different graphemes can be found. Of these, 330 belong to the HEAD class, 1047 to DIAC, 1649 to HAND, 1369 to ARRO, 1292 to STEM and 373 to ARC. In Table 6.3, these numbers can be compared to the number of different SHAPEs that can be found for each CLASS. As can be seen, the proportions are very different, meaning that some classes, like ARRO or DIAC have only a few different possible grapheme SHAPEs but are abundantly represented in the data, while other classes like HAND or ARC are less abundant compared to the variability within the class.

Tab. 6.3 − Counts of observations in the corpus by CLASS. Unique observations refer to those which share the same set of features. The rate of occurrence is the number of observations divided by the total number of logograms, measuring how likely a grapheme class is to appear in a logogram.
CLASS Graphemes Unique Observations SHAPEs Appearance Rate
HAND 1649 560 72 1.68
ARRO 1369 23 3 1.39
STEM 1292 15 2 1.32
DIAC 1047 19 19 1.07
ARC 373 37 6 0.38
HEAD 330 20 20 0.34
total 6060 674 122 6.17

Examining the rate of appearance of grapheme classes per logogram, we can also make some interesting observations. The average amount of graphemes per logogram is about 6, 1.68 of which are hands. Unfortunately, we cannot distinguish between bimanual signs and transcriptions where a transformation in handshape is encoded, but further, semantic annotation would make this clear. Easier to compute is the complexity of movement paths. Since most movements are marked with an arrow head, the ratio of segments (STEM and ARC) to arrow heads (ARRO) can give us an approximate measure of the mean number of segments for paths: \(\frac{1292+373}{1369} \approx 1.22\). This means that most movement markers are simple, with just one stem segment, but a non-trivial amount (approximately one every five) is more complex, having two or more segments.

If we examine the distribution of tag combinations, however, we can see a very skewed distribution, as is depicted in Figure 6.8. Some graphemes are very common, while many combinations are rare, forming a very long tail of infrequent graphemes. This also happens if we just look at the SHAPE feature, and across classes, as can be seen in Figure 6.9.

Distribution of unique tag combinations in the corpus. The most common 80 are arranged in the horizontal axis (not labeled for clarity), while the vertical axis represents the number of times that unique combination appears in the corpus. The bars are also color-coded by CLASS. The plot would extend to the right more than 6 times, making the long tail even longer and thinner.
Fig. 6.8 − Distribution of unique tag combinations in the corpus. The most common 80 are arranged in the horizontal axis (not labeled for clarity), while the vertical axis represents the number of times that unique combination appears in the corpus. The bars are also color-coded by CLASS. The plot would extend to the right more than 6 times, making the long tail even longer and thinner.
Distribution of shapes for some classes. The horizontal axis represents the shapes (not labeled for clarity), the vertical axis is the number of times that shape appears in the corpus. The most common CLASS is labeled near its bar for reference.
Fig. 6.9 − Distribution of shapes for some classes. The horizontal axis represents the shapes (not labeled for clarity), the vertical axis is the number of times that shape appears in the corpus. The most common CLASS is labeled near its bar for reference.

Since this is not a corpus of real utterances or texts, but rather a vocabulary, conclusions cannot be directly inferred about the frequency of elements in SignWriting in use. However, we can make observations across the vocabulary of the transcribed sign language, seeing that some particular gestures (understood in the broad linguistic sense) are much more common than others. For example, the “touch” grapheme is extremely common, as well as the PICAM- hand shape —fingers extended but together, which acts as the “flat” object descriptor in LSE (top left in Table 6.1).